Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Don't retry function upload on 400 or 422 status #478

Merged
merged 21 commits into from
Sep 22, 2023

Conversation

erikw
Copy link
Contributor

@erikw erikw commented Jul 20, 2023

# WIP
Don't retry function upload on 4xx. A 4xx response

When a function upload receives a 400 or 422 status code, we know that retrying the upload will fail. This change returns early rather than retrying.

@netlify
Copy link

netlify bot commented Jul 20, 2023

Deploy Preview for open-api ready!

Name Link
🔨 Latest commit 25f5b63
🔍 Latest deploy log https://app.netlify.com/sites/open-api/deploys/650ad802998da00008fed3bd
😎 Deploy Preview https://deploy-preview-478--open-api.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

go/porcelain/deploy.go Outdated Show resolved Hide resolved
Copy link
Contributor

@JGAntunes JGAntunes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction looks right 👍

Something I would suggest we had is a property (a boolean flag of some sort) that we can use to enable this new behaviour and that way feature flag it on Buildbot's side.

}

// TODO this changes retry behaviour for the fileUpload case as well. OK?
if apiErr.Code()/100 == 4 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would voice the same concern as @biruwon pointed out here. I'm concerned that retrying all the 4xx calls might be a stretch? (i.e. should we retry 429's too?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to just skip the retry for 400 and 422 status codes 🙂

@jenae-janzen
Copy link
Contributor

Something I would suggest we had is a property (a boolean flag of some sort) that we can use to enable this new behaviour and that way feature flag it on Buildbot's side.

I added this to the options struct, but am also open to passing it in as a parameter. I started adding it in Buildbot here, but it's not set up quite right yet and I haven't set up the flag in DevCycle yet.

@jenae-janzen jenae-janzen marked this pull request as ready for review August 24, 2023 18:35
@jenae-janzen jenae-janzen requested a review from a team as a code owner August 24, 2023 18:35
@jenae-janzen jenae-janzen changed the title Don't retry function upload on 4xx Feat: Don't retry function upload on 400 or 422 status Aug 24, 2023
@jenae-janzen
Copy link
Contributor

This is ready for review. We set skipRetry to false by default if no value is received for the feature flag. The accompanying change is also ready for review, but depends on this change going out first because this is where options is defined.

go/porcelain/deploy.go Outdated Show resolved Hide resolved
go/porcelain/deploy.go Show resolved Hide resolved
Copy link
Contributor

@JGAntunes JGAntunes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, let me know what you all think 👍

sharedErr.mutex.Lock()
sharedErr.err = operationError
sharedErr.mutex.Unlock()
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had to spend some time looking through this but I believe there's some room for improvement here (in the retry strategy I mean 😅)

I was initially going to say that we didn't want to return nil here but then I got that the idea is to actually exit the retry early. We have a check at the beggining of the retry method where we error out if sharedErr is set:

  • if sharedErr.err != nil {
    sharedErr.mutex.Unlock()
    return fmt.Errorf("aborting upload of file %s due to failed upload of another file", f.Name)
    }

However this case is also unideal because it just means we'll keep on retrying and erroring out here until we exhaust the count.

I was reading through the docs for backoff - https://pkg.go.dev/github.com/cenkalti/backoff/v4#Retry - and it seems like they have support for a PermanentError - https://pkg.go.dev/github.com/cenkalti/backoff/v4#PermanentError - which is a way for us to exit out of the retry loop while keeping the "error semantics" correct. It seems like this would be the perfect fit for this case? What do you think?

@@ -287,7 +287,7 @@ func TestUploadFiles_Cancelation(t *testing.T) {
for _, bundle := range files.Files {
d.Required = append(d.Required, bundle.Sum)
}
err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute)
err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could add a test here which exercises the new code paths we introduced (i.e. checking we don't retry 400 and 422 errors when uploading).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JGAntunes do you have any pointers for the testing? I seem to be a bit stuck here. We are also having trouble running the test suite locally, which is slowing this down a bit 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about running the test suite locally, if it seems like nothing is running it's probably because of the retries loop.

try running the tests as:

 go test -race github.com/netlify/open-api/v2/go/porcelain --run TestUploadFiles400Errors -v

And you should get some output (hopefully)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip @4xposed!! We got the tests running and can see the output which has been very helpful :D

sharedErr := &uploadError{err: nil, mutex: &sync.Mutex{}}
permanentErr := &backoff.PermanentError{Err: nil}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear for me why we instantiate permanentErr here, as could you elaborate?

as far as I know backoff.PermanentError is to be used as return from a backoff.Retry() block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially we weren't sure how to use PermanentError, but your other comment pointed us in the right direction I think

}

if skipRetry && (apiErr.Code() == 400 || apiErr.Code() == 422) {
operationError = permanentErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but I think we should wrap the existing error in a backoff.PermanentError{} (for which backoff provides the Permanent() function:

Suggested change
operationError = permanentErr
operationError = backoff.Permanent(operationError)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think that does make sense. And then when backoff receives the Permanent error, it should stop retrying, if I understand correctly.

ctx := gocontext.Background()

server := httptest.NewServer(http.HandlerFunc(func(rw http.ResponseWriter, _ *http.Request) {
rw.WriteHeader(http.StatusUnprocessableEntity)
Copy link
Contributor

@4xposed 4xposed Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to properly assert the error message in the requre.Equal in line 350 we need to make the mock server to return a body with a json that has code: and message keys:

Suggested change
rw.WriteHeader(http.StatusUnprocessableEntity)
rw.WriteHeader(http.StatusUnprocessableEntity)
rw.Header().Set("Content-Type", "application/json; charset=utf-8")
rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422}`))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super helpful and definitely makes sense! -- looking at other test cases I see this is done similarly. However, when we set rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422}`)) we keep getting the error&{0 } (*models.Error) is not supported by the TextConsumer, can be resolved by supporting TextUnmarshaler interface.

So then the apiError isn't set and then our test skips over the code we're trying to test and keeps retrying 🙈

Looking at other examples in other tests, it seems like it works to set the response this way when calling client.Operations.UploadDeployFile, but not when calling client.uploadFiles -- when calling uploadFiles the error is thrown from UploadDeployFile.

We've been trying to debug this by writing the body different ways, trying to marshal the response etc, but haven't had any success yet 🤔 Is there maybe something silly we're missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think It might be the order on which those are set, mockserver is a bit quirky at times. try if setting the headers in this order helps:

rw.Header().Set("Content-Type", "application/json; charset=utf-8")
rw.WriteHeader(http.StatusUnprocessableEntity)
rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422 }`))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @4xposed thanks so much for the tip! That worked and now as far as I can tell, the server is responding as expected. However, now we're hitting a different problem and struggling to debug it.

We now receive the error with the correct code and message, but when we get to this line
apiErr, ok := operationError.(apierrors.Error)
no matter what we try, apiErr is always nil and ok is always false.

operationError is an error that looks like:
error(*github.com/netlify/open-api/v2/go/plumbing/operations.UploadDeployFunctionDefault) *{_statusCode: 422, Payload: *github.com/netlify/open-api/v2/go/models.Error {Code: 422, Message: "Unprocessable Entity"}}

In my debugger, I can access the status code by calling either operationError._statusCode or operationError.Payload.Code, but not by calling operationError.(apierrors.Error). However, I can't call operationError.Payload.Code in the actual code, because Payload/Code isn't defined on the Error type.

So since operationError.(apierrors.Error) never seems to evaluate to anything and there wasn't a test for the original code, we're uncertain if we're doing something wrong here, or if the original code was actually working as intended.

Are we missing something here? Is there a better way to test this (perhaps non-locally?)

cc @jrwhitmer @JGAntunes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4xposed Thanks, I really appreciate the offer! I'm in Europe now 🎉

Testing manually using the debugger, it looks like everything's working as expected, thanks so much for the fix. The one last thing I'm trying to do to wrap up this PR is to fix my tests:

require.Equal(t, err, "[PUT /deploys/{deploy_id}/files/{path}][422] uploadDeployFile default  &{Code:422 Message: Unprocessable Entity}")

doesn't work because the error's now wrapped by the backoff.Permanent error. Is it sufficient to just requireError(t, err) here? Or is there a better way to assert this that I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pairing might not be necessary since most of this seems to be working, but I'd appreciate you taking another look when you have the time 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could call err.Unwrap() seems like backoff.PermanentError has a few methods for this: https://github.com/cenkalti/backoff/blob/v4/retry.go#L129

I don't think it's strictly necessary, but it would be nice to at least have some way that we have the right error and not any error, if that makes sense.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a shot! I do agree it would be nice to have the right error if possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I used Require. ErrorContains instead and that seemed to do the trick for verifying we have the correct error message

// Set SkipRetry to false
err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute, false)
// require.Equal(t, err, "[PUT /deploys/{deploy_id}/files/{path}][422] uploadDeployFile default &{Code:422 Message: Unprocessable Entity}")
require.Equal(t, attempts, 12)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of attempts doesn't seem to be consistent, so maybe there's a better way to assert this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4xposed this is one other tiny detail that sometimes makes the test fail - every once in a while it retries 13 times 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think (emphasis on think) it retries infinitely until the context times out or it's cancelled.

I think checking that there's more than 1 attempt (or some arbitrary number that's not too high) might be enough, maybe?

I will try to look more into the retry mechanism tomorrow morning to try to figure out if we can have more control in the test of how many times we retry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check to just make sure that the attempts are greater than 1 :) That should probably be sufficient for what we're testing

@jenae-janzen jenae-janzen merged commit 9b75de7 into master Sep 22, 2023
14 checks passed
@jenae-janzen jenae-janzen deleted the erikw/lambda-error-msg branch September 22, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants